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ABSTRACT 

In this paper we describe the linguistic processor 
of a spoken dialogue system. The parser receives a 
word graph from the recognition module as its input. 
Its task is to find the best path through the graph. If 
no complete solution can be found, a robust mecha- 
nism for selecting multiple partial results is applied. 
We show how the information content rate of the re- 
sults can be improved if the selection is based on an 
integrated quality score combining word recognition 
scores and context-dependent semantic predictions. 
Results of parsing word graphs with and without pre- 
dictions are reported. 

1. INTRODUCTION 

The linguistic processing (LP) component of a spo- 
ken dialogue system (SDS) must be robust in order to 
deal with recognition errors and spontaneous speech 
phenomena. 

In the following we describe our approach towards 
robustness. This LP was developed in the project 
SYSLID (SYntactic and Semantic Linguistic Pro- 
cessing for Spoken Dialogue Systems). It is fully in- 
tegrated into the Daimler-Benz SDS [3] for German 
train timetable inquiries. The architecture of this sys- 
tem is shown in Figure |l|. 

It has been pointed out in [5] that a robust parser 
which may deliver multiple partial results has to cope 
with the problem of deciding which partial results 
should be selected. The solution suggested in [5] re- 
lies on the assignment of a quality score to each par- 
tial solution generated during parsing by means of a 
scoring function which integrates acoustic, syntactic, 
and semantic quality measures. 

In the present paper we give a more detailed de- 
scription of the implementation of this approach com- 
bining probabilistic and symbolic knowledge. First, 
we will illustrate why both contextual knowledge and 
recognition scores are important for flexible robust 
parsing. Next, the processing of these knowledge 
sources in the parser is described. Finally, we evaluate 
this approach by comparing the information content 
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Fig. 1 

System architecture 



rates of analysis results that were produced with and 
without semantic predictions. 

2. CONTEXTUAL KNOWLEDGE AND 
RECOGNITION SCORE 

The parser receives a word graph as its input. The 
nodes of the graph represent points in time and the 
edges are labelled with scored word hypotheses. 

Figure ^ shows a simplified word graph that con- 
tains three alternative one-word sentence hypothe- 
ses. Scores are positive numbers which assign a 
(pseudo-)probability measure to a word hypothesis: 
The smaller the score the higher the probability, i.e., 
in the example graph the hypothesis [1 er 22.08 2] 
has the best score (22.08). If one adopts the tradi- 
tional view that it is the task of the parser to find the 
best scoring interpretation, then we would expect the 
parser to deliver er (he) as solution. 



er 22.08 
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urn 32.07 



zehn 28.23 



uhr 27.83 




Fig. 2 
A simple word graph 



Fig. 3 
Word graph example 



Now let us assume the following dialogue context: 
userl: Ich mochte morgen nach Ulm fahren. 

I want to go to Ulm tomorrow, 
systeml: Sie wollen nach Ulm fahren? 
You want to go to Ulm? 

Being a yes-no question, the last system utterance 
generates the expectation that the interjections ja 
(yes) or nein (no) will be contained in the user reply. 
Analyzing the graph in Figure |2| with this dialogue 
context, it is much more likely that the hypothesis [1 
ja 31.25 2] is the correct solution, although it has 
not the best score. 

Contextual expectations are mapped onto seman- 
tic predictions which are passed down to the LP in 
our system (cf. Figure |l]). The predictions are gen- 
erated on the basis of the last system utterance in 
a way similar to the dynamic prediction mechanism 
described in [1] . One possible way of using these pre- 
dictions is to filter out all results which are incompat- 
ible with the predictions. This strategy would have 
the desired effect in the above example, but lead to 
a very restricted dialogue, because all additional user 
information were eliminated by this rigid filter. For 
example, in the context above a user might be over- 
informative and instead of simply confirming might 
answer: 

user2: Ja um zehn Uhr. 

Yes at ten o'clock. 

Therefore, we prefer a less rigid strategy: If the 
semantic content of a (partial) result agrees with a 
top-down prediction, the result has a high pragmatic 
relevance, otherwise low. Pragmatic relevance is ex- 
pressed as a numerical value which can be used to 
calculate a quality score integrating scores of different 
processing levels. The basic idea thus is to increase 
the overall score for predicted hypotheses in order to 
compensate lower recognition scores. 

Assume, for example, that the word graph in Fig- 
ure 3 was generated as the recognizer output of an- 
alyzing utterance user2. In the context of a yes-no 
question we would like to increase the overall score of 
the predicted hypothesis ja in the first part of the ut- 
terance. But the overinformative second part of the 



utterance, um zehn Uhr, should still be acceptable 
as a parse result, although it may not correspond to 
context expectations. The dialogue component of our 
system is flexible enough to interpret such additional 
information (cf. [4]). 

3. AN INTEGRATED QUALITY SCORE 
FOR CHART EDGES 

We use a chart-based island parser implemented in 
Prolog which looks for the best scored, grammatically 
correct sentence hypothesis in the graph. It performs 
an agenda-driven heuristic search (cf . [5] ) . The chart 
of the parser is initialized with the word hypotheses 
of the input graph. The linguistic knowledge base of 
the parser is a highly lexicalized Unification Catego- 
rial Grammar (UCG) represented in DATR (cf. [2]). 
In UCG, syntactic and semantic structures are rep- 
resented and constructed in an integrated way. Ex- 
ample ([!]) shows the lexical entry of ja, which has the 
syntactic category part (particle) and the semantic 
type dmjmarker (dialogue manager marker). 

mor : [ form : ja ] 

syn : [ head : [ major : part ] (1) 
sem : [ type : dm_marker, value : yes ] 

Semantic predictions are provided from the dialogue 
manager in a format compatible with the semantic 
representations of lexical entries, e.g., the dialogue 
context "yes-no question" generates the prediction 
list shown in 

[ type : dmjmarker, value : yes ] 
[ type : dmjmarker, value : no ] 

The semantic attribute- value pairs of both lexical en- 
tries and predictions are compiled into Prolog terms 
with the same program. Thus, agreement of a chart 
edge with semantic predictions can be checked with 
standard Prolog unification. 

The predictions are used by the parser in two ways: 
First, they serve as seed definitions for the island 
parser, which can thus start its search from prag- 
matically relevant islands. Second, the predictions 



contribute to the integrated quality score which is 
assigned to each partial result during parsing. In or- 
der to integrate the symbolic contextual knowledge 
with the numerical recognition score we use a function 
pr{E) which maps the agreement with a prediction 
onto a numerical value.Q In our current experiments 
we use the following heuristic weightings: pr(E) = 4 
if the semantic type of a chart edge E unifies with one 
of the top-down predictions, otherwise pr(E) = 1. 

The computation of the integrated quality score 
QS of an edge E is defined as follows: 



QS(E) 



Qa{E) 



sc(E) x pr(E) 



(3) 



where Q a denotes the acoustic quality, sc the value 
for syntactic completeness^, and pr the value for prag- 
matic relevance. The interpretation of QS is like that 
of the recognition score, i.e., the smaller the better. 
The acoustic quality Q a is given by: 



ME) = 



\ength{E) 



(4) 



The shortfall function sf(E) (cf. [9, p. 298]) for a 
given edge E(i,j) that covers a segment from node i 
to node j is given by 

sf(E) = Maxseg — maxseg(i, j) + RS(E) (5) 

where Maxseg is the maximum total score of the 
whole graph, maxseg{i, j) is the maximum score of 
the segment i to j, and RS(E) is the recognition 
score. For example, the shortfall of the hypothesis [1 
ja 31.25 2] in Figure 3 is 110.21 - 22.08 + 31.25 = 
119.38, which reflects the fact that a complete so- 
lution including this hypothesis is 9.17 points worse 
than the best scoring path [er urn zehn uhr] . 

The RS of an combined edge CE, which was com- 
posed of two edges E\ and E 2 , is defined as the sum 
of E\ and E 2 . 

Given these definitions, we can now illustrate the 
effect of semantic predictions on parsing the word 
graph in Figure 3. Assume the graph is parsed 
as an answer to a yes- no question, i.e., the pre- 
diction list given in (^) is used. Only one hy- 
pothesis, ja, unifies with one of the predictions, 
[type : dmjnarker , value: ye s] , i.e., pr(ja) = 4. 
Thus, its quality score is 119 4 38 = 29.84, whereas the 
scores of the alternative hypotheses spanning from 
node 1 to 2 stay equal to the acoustic quality due to 
pr(E) = 1. Let us further assume that the grammar 

1 A similar score called pragmatic priority was also used in 
the EVAR system (cf. [7]). 

2 For the sake of simplicity we do not consider syntactic com- 
pleteness in the following examples by setting sc(E) = 1 for all 



allows building a prepositional time phrase urn zehn 
uhr. Since no time expression is predicted, the over- 
all quality score of this phrase is equal to the acoustic 

quality Q a , i.e., HO.21-88^.76+88.76 = 36 ?4 

Under the assumption that the grammar does not 
contain a rule to combine ja and um zehn uhr, the 
parser will terminate without having found a com- 
plete solution that spans the whole input. In this 
case, the robust mechanism of selecting multiple par- 
tial results is applied: Starting from the edge with the 
best quality score, the best scoring adjacent edges are 
collected recursively until a sequence of partial results 
spanning the whole utterance is found. In our exam- 
ple, the predicted result ja has got the best quality 
score during parsing. Since it is located at the begin- 
ning of the graph, no left-adjacent solutions have to 
be looked for. Among the right-adjacent edges um, 
uns , und, um zwei uhr and um zehn uhr the lat- 
ter has the best score and is selected. Its end node 
marks the end of the graph, too. Thus a sequence 
of partial results through the graph was found and 
the LP hands over these results as Semantic Inter- 
face Language (SIL, cf. [6]) structures to the dialogue 
manager (cf. Figure 0). Examples ((^) and (0) show 
the selected parsing results in SIL format. 
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id : B 

category : part 
string : ja 
score : 29.84 
id : B 

type : dmjmarker 
value : yes 



(G) 



id:C 



syn 



sem 



id : D 

category : prep 
string : umjzehn-uhr 
score : 36.74 
id : D 
type : time 

~ id: E 



thehour 



type : hour 
value : 10 



(7) 



4. EVALUATION 

The main task of a parser in a speech understand- 
ing system is to determine the meaning of the spo- 
ken utterance. It has been argued in [8] that the 
sentence understanding capabilities of a SDS are best 
judged by the information content (IC) metric. IC 
calculates the percentage of task-relevant information 
(TRI) contained in the parser output. This requires 
the annotation of each utterance with a series of 



attribute-value pairs, where each attribute is a task- 
relevant concept (TRC). In the present domain of 
timetable inquiries, examples of TRCs are: source- 
city, goalcity, time, date, dmjnarker. For ex- 
ample, the TRI of the utterance ja um 10 Uhr is 
[dm_marker : yes , time : 10] . This reference annota- 
tion is called RTRI. 

IC can then be calculated by comparing RTRI 
with the parser output. For that purpose the SIL 
structures produced by the parser are translated into 
attribute-value pairs compatible with the ones of 
RTRI, e.g., the structure shown in (|^) is mapped to 
[time: 10]. The output of this translation is called 
PTRI. 

Performance of the robust parser is measured by 
the metric IC, which is calculated as a percentage 
using formula (||) 

IC = 100 ( 1 _i+i+A) (8) 

\ items J 

where items is the total number of items in RTRI 
and i, s, and d are the numbers of items inserted, 
substituted, and deleted in PTRI, respectively.^ 

Assume, for example, that the word graph in Fig- 
ure 3, whose RTRI is [dm_marker : yes , time: 10], 
is parsed without predictions. This will produce two 
partial results, namely er and um 10 uhr. The SIL 
structure of the former cannot be mapped to a TRC. 
Thus PTRI is [time: 10] , i.e., d — 1 because one of 
the RTRI items is deleted in PTRI. This yields an IC 
of 100 (1 - |) = 50%. 

The parser was tested in stand-alone mode on 50 
word graphs generated by the Daimler-Benz word rec- 
ognizer [3]. The graphs had a density of 4 edges per 
spoken word and a word accuracy rate of 73.3%. 

To measure the impact of the predictions we first 
parsed the graphs without predictions. In the second 
setup, semantic predictions were handed over as an 
additional argument to the parser. The choice of the 
prediction was determined by the original dialogue 
context of the utterance. 

The results are shown in the following table, where 
ic-pr and ic+pr are the IC rates without and with 
predictions, respectively, and t-pr and t+pr are the 
corresponding average parse times (in seconds) taken 
on a SPARCstation 10. 



ic-pr 


t-pr 


ic+pr 


t+pr 


67.48 


0.67 


76.48 


0.74 



These figures show a 9% increase of the IC rate when 
using contextual knowledge in the parser. Most of 

3 See [8] for a definition of how to calculate the number of 
insertions, deletions, and substitutions. 



this improvement was attained in the analysis of very 
short, elliptical utterances typically provided as an- 
swers to yes-no questions. In these cases, exemplified 
in Figure 2, the linguistic grammar cannot contribute 
much so that merely contextual expectations allow a 
well-founded choice among competing alternatives. 

5. CONCLUSION AND FURTHER WORK 

We have presented a mechanism for integrating 
contextual knowledge into the linguistic processor of 
a spoken dialogue system. The reported results show 
that the use of predictions can improve the IC rate of 
the parser. Most improvement is gained in the anal- 
ysis of short utterances. To determine the IC, the 
parsing results had to be inspected manually. In the 
near future an annotated test suite of word graphs 
will be built up in order to automate the evaluation. 

Furthermore, we intend to measure the IC given 
different processing time limits. As mentioned in sec- 
tion 3, the predictions are used as pragmatic seed 
definitions of the island parser. Thus partial results 
which are most relevant for understanding are built 
at an early stage of processing. Due to this strat- 
egy we expect acceptable IC rates even with a strong 
limit on processing time as it may be necessary in a 
real time system. 
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